Tone features for speech recognition

ABSTRACT

Robust acoustic tone features are achieved first by the introduction of on-line, look-ahead trace back of the fundamental frequency (F 0 ) contour with adaptive pruning, this fundamental frequency serves as the signal preprocessing front-end. The F 0  contour is subsequently decomposed into lexical tone effect, phrase intonation effect, and random effect by means of time-variant, weighted moving average (MA) filter in conjunction with weighted (placing more emphasis on vowels) least squares of the F 0  contour. The intonation effect is removed by subtraction of the F 0  contour under superposition assumption. The acoustic tone features are defined as two parts. First, is the coefficients of the second order weighted regression of the de-intonation of the F 0  contour over neighbouring frames. The second part deals with the degree of the periodicity of the signal, which are the coefficients of the second order regression of the auto-correlation. These weights of the second order weighted regression of the de-intonation of the F 0  contour are designed to emphasize/de-emphasize the voiced/unvoiced segments of the pitch contour in order to preserve the voiced pitch contour for the semi-voiced consonants.

The invention relates to automatic recognition of tonal languages, suchas Mandarin Chinese.

Speech recognition systems, such as large vocabulary continuous speechrecognition systems, typically use an acoustic/phonetic model and alanguage model to recognize a speech input pattern. Before recognizingthe speech signal, the signal is spectrally and/or temporally analyzedto calculate a representative vector of features (observation vector,OV). Typically, the speech signal is digitized (e.g. sampled at a rateof 6.67 kHz.) and pre-processed, for instance by applying pre-emphasis.Consecutive samples are grouped (blocked) into frames, corresponding to,for instance, 20 or 32 msec. of speech signal. Successive framespartially overlap, for instance, 10 or 16 msec, respectively. Often theLinear Predictive Coding (LPC) spectral analysis method is used tocalculate for each frame a representative vector of features(observation vector). The feature vector may, for instance, have 24, 32or 63 components. The acoustic model is then used to estimate theprobability of a sequence of observation vectors for a given wordstring. For a large vocabulary system, this is usually performed bymatching the observation vectors against an inventory of speechrecognition units. A speech recognition unit is represented by asequence of acoustic references. As an example, a whole word or even agroup of words may be represented by one speech recognition unit. Alsolinguistically based sub-word units are used, such as phones, diphonesor syllables, as well as derivative units, such as fenenes and fenones.For sub-word based systems, a word model is given by a lexicon,describing the sequence of sub-word units relating to a word of thevocabulary, and the sub-word models, describing sequences of acousticreferences of the involved speech recognition unit. The (sub-)wordmodels are typically based on Hidden Markov Models (HMMs), which arewidely used to stochastically model speech signals. The observationvectors are matched against all sequences of speech recognition units,providing the likelihoods of a match between the vector and a sequence.If sub-word units are used, the lexicon limits the possible sequence ofsub-word units to sequences in the lexicon. A language model placesfurther constraints on the matching so that the paths investigated arethose corresponding to word sequences which are proper sequences asspecified by the language model. Combining the results of the acousticmodel with those of the language model produces a recognized sentence.

Most existing speech recognition systems have been primarily developedfor Western languages, like English or German. Since the tone of a wordin Western based languages does not influence the meaning, the acousticrealization of tone reflected in a pitch contour is considered as noiseand disregarded. The feature vector and acoustic model do not includetone information. For so-called tonal languages, like Chinese, tonalinformation influences the meaning of the utterance. Lexical tonepronunciation plays a part in the correct pronunciation of Chinesecharacters and is reflected by the acoustic evidence such as a pitchcontour. For example, the language spoken most world-wide, MandarinChinese, has five different tones (prototypic within syllable pitchcontours), commonly characterized as “high” (flat fundamental frequencyF₀ contour) “rising” (rising F₀ contour), “low-rising” (a low contour,either flat or dip), “falling” (falling contour, possibly from high F₀),and “neutral” (neutral, possibly characterized by a small, short fallingcontour from low F₀). In continuous speech, the low-rising tone may beconsidered just a “low” tone. The same syllable pronounced withdifferent tones usually has entirely different meanings. MandarinChinese tone modeling, intuitively, is based on the fact that people canrecognize the lexical tone of a spoken Mandarin Chinese characterdirectly from the pattern of the voiced fundamental frequency.

Thus, it is desired to use lexical tone information as one of theknowledge sources when developing a high-accuracy tonal language speechrecognizer. To integrate tone modeling, it is desired to determinesuitable features to be incorporated in the existing acoustic model orin an additional tone model. It is already known to use the pitch(fundamental frequency, F₀) or log pitch as a component in a tonefeature vector. Tone feature vectors typically also include first (andoptionally second) derivatives of the pitch. In multi-pass systems,often energy and duration information is also included in the tonefeature vector. Measurement of pitch has been a research topic fordecades. A common problem of basic pitch-detection algorithms (PDAs) isthe occurrence of multiple/sub-multiple gross pitch errors. Such errorsdistort the pitch contour. In a classical approach to Mandarin tonemodels the speech signal is analyzed to determine if it is voiced orunvoiced. A pre-processing front-end must estimate pitch reliablywithout introducing multiple/sub-multiple pitch errors. This is mostlydone, either by fine-tuning thresholds between multiple pitch errors andsub-multiple pitch errors, or by local constraints on possible pitchmovements. Typically, the pitch estimate is improved by maximizing thesimilarity inside the speech signal in order to be robust againstmultiple/sub-multiple pitch errors via smoothing, e.g. median filter,together with prior knowledge of the reasonable pitch range andmovement. The lexical tone of every recognized character or syllable, isdecoded independently by stochastic HMMs. This approach has manydefects. A lexical tone exists only on the voiced segments of Chinesecharacters and it is therefore desired to extract pitch contours for thevoiced segments of speech. However, it is notoriously difficult to takea voiced-unvoiced decision for a segment of speech. A voiced/unvoiceddecision cannot be determined reliably at pre-processing front-endlevel. A further drawback is that the smoothing coefficients(thresholds) of the smoothing filter are quite corpus dependent. Inaddition, the architecture of this type of tone model is too complex tobe applied on real-time, large vocabulary dictation system whichnowadays are mainly executed a on personal computer. To overcomemultiple/sub-multiple pitch errors, the dynamic programming (DP)technique has also been used in conjunction with the knowledge ofcontinuity characteristics of pitch contours. However, theutterance-based nature of plain DP prohibits its use in online systems.

It is an object of the invention to improve the extraction of tonefeatures from a speech signal. It is a further object to definecomponents, other than pitch, for a speech feature vector suitable forautomatic recognition of speech spoken in a tonal language.

To improve the extraction of tone features, the following algorithmicimprovements are introduced:

A two step approach to pitch extraction technique:

At low resolution, a pitch contour is determined, preferably in thefrequency domain

At high resolution fine tuning occurs, preferably in the time domain bymaximization of the normalized correlation inside quasi-periodic signalin an analysis window that contains more than one complete pitch period.

The low resolution pitch contour determining preferably includes:

Determining pitch information based on a similarity measure inside thespeech signal, preferably based on subharmonic summation in thefrequency domain

Using dynamic programming (DP) to eliminate multiple and sub-multiplepitch errors.

The dynamic programming preferably includes:

Adaptive beam-pruning for efficiency,

Fixed-length partial traceback for guaranteeing a maximum delay, and

Bridging unvoiced and silence segments.

These improvements may be used in combination or in isolation, combinedwith conventional techniques.

To improve the feature vector, the speech feature vector includes acomponent representing an estimated degree of voicing of the speechsegment to which the feature vector relates. In a preferred embodiment,the feature vector also includes a component representing the first orsecond derivative of the estimated degree of voicing. In an embodiment,the feature vector includes a component representing a first or secondderivative of an estimated pitch of the segment. In an embodiment thefeature vector includes a component representing the pitch of thesegment. Preferably, the pitch is normalized by subtracting the averageneighborhood pitch to eliminate speaker and phrase effect.Advantageously, the normalization is based on using the degree ofvoicing as a weighting factor. It will be appreciated that a vectorcomponent may include the involved parameter itself or any suitablemeasure, like a log, of the parameter.

It should be noted that also a simplified Mandarin tone model has beenused. In such a model a pseudo pitch is created byinterpolation/extrapolation from voiced to unvoiced segments since avoiced/unvoiced decision cannot be determined reliably. Knowledge of adegree of voicing has not been put to practical use. Ignoring theknowledge of the degree of voicing is undesired, since the degree ofvoicing is a knowledge source that certainly improves recognition. Forinstance, the movement of pitch is quite slow (1%/1 ms) in voicedsegments, but jumps quickly in voiced-unvoiced or unvoiced-voicedsegments. The system according to the invention explores the knowledgeof degree of voicing.

These and other aspects of the invention will be apparent from andelucidated with reference to the embodiments shown in the drawings.

FIG. 1 illustrates a three-stage extraction of tone features;

FIG. 2 shows an example pitch contour and degree of voicing;

FIGS. 3A and B illustrate the use of a weighted filtering; and

FIG. 4 is a block diagram of the system according to the presentinvention.

The speech processing system according to the invention may beimplemented using conventional hardware. For instance, a speechrecognition system may be implemented on a computer, such as a PC, wherethe speech input is received via a microphone and digitized by aconventional audio interface card. All additional processing takes placein the form of software procedures executed by the CPU. In particular,the speech may be received via a telephone connection, e.g. using aconventional modem in the computer. The speech processing may also beperformed using dedicated hardware, e.g. built around a DSP. Sincespeech recognition systems are generally known, here only detailsrelevant for the invention are described in more detail. Details aremainly given for the Mandarin Chinese language. A person skilled in theart can easily adapt the techniques shown here to other tonal languages.

FIG. 1 illustrates three independent processing stages to extract tonefeatures of an observation vector ō(t) from a speech signal s(n). Theinvention offers improvements in all three areas. Preferably, theimprovements are used in combination. However, they can also be usedindependently where for the other stages conventional technology isused. In the first stage a periodicity measure (pitch) is determined. Tothis end, the incoming speech signal s(n) is divided into overlappingframes with preferably a 10 msec. shift. For every frame at time t ameasure p(f, t) for a range of frequencies f is determined expressinghow periodic the signal is for the frequency f. As will be described inmore detail below, preferably the subharmonic summation (SHS) algorithmis used to determine p(f, t). The second stage introduces continuityconstraints to increase robustness. Its output is a sequence of rawpitch-feature vectors, which consist of the actual pitch estimate{circumflex over (F)}₀ (t) and the corresponding degree of voicingv({circumflex over (F)}₀(t),t) (advantageously a normalized short timeautocorrelation is used as a measure of the degree of voicing).Preferably, the continuity constraints are applied using dynamicprogramming (DP) as will be described in more detail below. In the thirdstage, labeled FEAT, post-processing and normalization operations areperformed and the actual sequence of tone features of the vector o(t)are derived. Details will be provided below.

Periodicity Measure

A preferred method for determining pitch information will now bedescribed. The speech signal may be received in analogue form. If so, anA/D converter may be used to convert the speech signal into a sampleddigital signal. Information of the pitch for possible fundamentalfrequencies F₀ in the range of physical vibration of human vocal cord isextracted from the digitized speech signal. Next, a measure of theperiodicity is determined. Most pitch detection algorithms are based onmaximizing a measure like p(f, t) over the expected F₀ range. In thetime-domain, typically such measures are based on the signal'sauto-correlation function r_(s) _(t) _(s) _(t) (1/f) or a distancemeasure (like AMDF). According to the invention, the subharmonicsummation (SHS) algorithm is used, which operates in the frequencydomain and provides the sub-harmonic sum as a measure. The digitalsampled speech signal is sent to the robust tone feature extractionfront-end where the sampled speech signal is, preferably, first lowpassed with cut-off frequency less than 1250 Hz. In a simpleimplementation, a low-pass filter can be implemented as a moving averageFIR filter. Next, the signal is segmented into a number of analysisgates, equal in width and overlapped in time. Every analysis gate ismultiplied (“windowed”) by a commonly used kernel in speech analysiscalled hamming window, or equivalent window. The analysis window mustcontain at least one complete pitch period. A reasonable range of pitchperiod τ is within 2.86 ms=0.00286 s=1/350≦τ≦1/50=0.020 s=20 ms So,preferably the window length is at least 20 ms.

A representation of the sampled speech signal in an analysis gate (alsoreferred to as segment or frame) is then calculated, preferably usingthe Fast Fourier transform (FFT), to generate the spectrum. The spectrumis then squared to yield the power spectrum. Preferably, the peaks ofthe amplitude spectrum are enhanced for robustness. The power spectrumis then preferably smoothed by a triangular kernel (advantageously withof low-pass filter coefficients: ¼, ½, ¼) to yield the smoothedamplitude spectrum. Next, it is preferred to apply cubic splineinterpolation of I_(resolution) points (preferably no more than 16equidistant points per octave, at low frequency resolution, for fastfinding the correct route) on the kernel smoothed amplitude spectrum.Auditory sensitivity compensation on spline interpolated power spectrumis preferably performed by an arc-tangent function on the logarithmicfrequency scale:${A\left( {\log_{2}f} \right)} = {0.5 + \frac{\tan^{- 1}\left( {3.0*\log_{2}f} \right)}{\pi}}$

For the possible fundamental frequencies F₀ in the range of physicalvibration of human vocal cord, subharmonic summation is then applied toyield the information of the pitch.${\sum\limits_{k = 1}^{k = 15}{w_{k}*{P\left( {\log_{2}({kf})} \right)}*{I\left( {{kf} < 1250} \right)}}},$

 ∀k=1,2 , . . . , N _(suharmonics) w _(k)=(c)^(k−)1,

where P(log₂(f))=C(log₂(f))*A(log₂(f)), where C(log2(f)) is the splineinterpolated from S(log₂(f)), the power spectrum from FFT, c is thenoise compensation factor. Advantageously, for microphone input: c=0.84;for telephone input: c=0.87. f is the pitch (in Hz), 50≦f≦350. The SHSalgorithm is described in detail in D. Hermes, “Measurement of pitch bysubharmonic summation”, J. Acoust. Soc. Am. 83 (1), January 1988, herebyincluded by reference. Here only a summary is given of SHS. Let s_(t)(nrepresent the incoming speech signal windowed at frame t and letS_(t)(f) be its Fourier transform. Conceptually, the fundamentalfrequency is determined by computing the energy E_(f) of s_(t)(n)projected onto the sub-space of functions periodic with f:$E_{f} = {\sum\limits_{n = {- \infty}}^{\infty}{{S_{t}^{\prime}({nf})}}^{2}}$

and maximizing with respect to f. In the actual SHS method described byHermes, various refinements are introduced, by using instead thepeak-enhanced amplitude spectrum |S′_(t)|, weighted by a filter W(f)representing the sensitivity of the auditory system, and emphasizing thelower harmonics by weighting with weights h_(i), efficiently realized bymeans of Fast Fourier Transform, interpolation, using and superpositionon logarithmic scale, arriving at:${p\left( {f,t} \right)} = {\sum\limits_{n = 1}^{N}{h_{1}^{n - 1}\left( {{{S_{t}^{\prime}({nf})}}{W({nf})}} \right)}}$

In this equation, N represents the number of harmonics.

Continuity Constraints

A straightforward estimate of the pitch is given by: {circumflex over(F)}₀(t)=arg max_(f) p(f,t). However, due to the lack of continuityconstraints across frames, it is prone to so-called multiple andsub-multiple pitch errors, most prevalent in the telephone corpus due tobroadband channel noise. According to the invention, the principle ofdynamic programming is used to introduce continuity (in the voicedsegments of speech). As such, pitch is not estimated in isolation.Instead, by considering the neighboring frames, pitch is estimated in aglobal minimum path error. Based on the continuity characteristic ofpitch in voiced segments of speech, pitch varies within a limited range(around 1%/msec.). This information can be utilized to avoidmultiple/submultiple pitch errors. Using dynamic programming ensuresthat the pitch estimation follows the correct route. It should berealized that pitch changes dramatically on the voiced-unvoiced segmentsof speech. Moreover, a full search scheme for a given path boundary istime-consuming (due to its unnecessary long processing delay), whichmakes it almost impossible to implemented in real-time system for pitchtracking with subjective high tone quality. These drawbacks are overcomeas will be described in more detail below.

Dynamic Programming

The continuity constraint can be included by formulating pitch detectionas: $\begin{matrix}{{{\hat{F}}_{0}\left( {1\quad \ldots \quad T} \right)} = {\underset{F_{0}{({1\quad \ldots \quad T})}}{\arg \quad \max}{\sum\limits_{t = 1}^{T}{{p\left( {{F_{0}(t)},t} \right)} \cdot a_{{{F_{0}{(t)}}}{F_{0}{({t - 1})}}}}}}} & (1)\end{matrix}$

where a_(f) ₂ _(|f) ₁ penalizes or forbids rapid changes of pitch. Byquantizing F₀, this criterion can be solved by dynamic programming (DP).

In many systems, the pitch value is set to 0 in silence and unvoicedregions. This leads to problems with zero variances and undefinedderivatives at the voiced/unvoiced boundaries. It is known to “bridge”these regions by exponentially decaying pitch towards the runningaverage. Advantageously, DP provides an effective way for bridgingunvoiced and silence regions. It leads to “extrapolation” of asyllable's pitch contour (located in the syllable's main vowel),backwards in time into its initial consonant. This was found to provideadditional useful information to the recognizer.

Partial Traceback

The fact that equation (1) requires to process the entire T frames of anutterance before the pitch contour can be decided renders it lesssuitable for online operation. According to the invention, a partialtraceback is performed, exploiting the path merging property of DP. Initself the technique of back tracing is well-known from Viterbi decodingduring speech recognition. Therefore, no extensive details are givenhere. It is preferred to use a fixed-length partial traceback thatguarantees a maximum delay: at every frame t, the local best path isdetermined and traced back ΔT₁ frames. If ΔT₁ is large enough, theso-determined pitch {circumflex over (F)}₀(t−ΔT₁) can be expected to bereliable. Experiments show that the delay can be limited to around 150msec., which is short enough to avoid any noticeable delay for the user.

Beam Pruning

In the above form, path recombinations constitute the major portion ofCPU effort. For effort reduction, beam pruning is used. In itself beampruning is also well-known from speech recognition and will not bedescribed in full detail here. For every frame, only a subset of pathspromising to lead to global optimum is considered. Paths with scoressc(t) with:$\frac{{{sc}(t)} - {{sc}_{opt}\left( {t - {\Delta \quad T_{2}}} \right)}}{{{sc}_{opt}(t)} - {{sc}_{opt}\left( {t - {\Delta \quad T_{2}}} \right)}} < {threshold}$

are discontinued (sc_(opt)(τ)=local best score at time τ).

Since efficiency is a major concern, as much as possible pruning ispreferred without damaging quality. In the dynamic programming step,dramatic changes exist in estimating pitch even after applying dynamicprogramming technique in the voiced-unvoiced segments of speech. This isbecause in pure silence region, there is no information of periodicity:all possible pitch values are equally likely. Theoretically, no pruningis necessary at this point. On the other hand, in pure speech region,there is a lot of periodicity information, the distribution of pitchhave many peaks on the multiples/sub-multiples of correct pitch. At thispoint, pruning some paths which has very low accumulated score isappropriate. The pruning criteria preferably also consider the effect ofsilence. If at the beginning of a sentence there exists a silence regionof more than approximately 1.0 sec., pruning should preferably not takeplace. Experiments have shown that by pruning some paths which have ‘sofar’ an accumulated score of less than 99.9% of the ‘so far’ highestaccumulated score will result in loosing the correct route of pitch. Onthe other hand, pruning some paths which have ‘from 0.50 s to so far’accumulated a score of less than 99.9% of the ‘from 0.50 s so far’highest accumulated score will result in keeping the correct route andsave up to 96.6% loop consumption compared to full search scheme.

Reduction of Resolution

The number of path recombinations is proportional to the square of theDP's frequency resolution. Significant speed-up can be achieved byreducing the resolution of the frequency axis in DP. A lower resolutionlimit is observed at around 50 quantization steps per octave. Belowthat, the DP path becomes inaccurate. It has been found that the limitcan be lowered further by a factor of three, if each frame's pitchestimate {circumflex over (F)}₀(t) is fine-tuned after DP in thevicinity of the rough path. Preferably this is done by maximizing v(f,t) at higher resolution within the quantization step Q(t) from thelow-resolution path, i.e.: {circumflex over (F)}₀(t)=argmax_(fεQ(t))v(f,t)

A preferred method for the maximization of the look-ahead, locallikelihood of the F0 with adaptive pruning using the present inventionwill now be described. In summary, the following steps occur:

Calculating the transition scores of every possible pitch movement inthe voiced segments of speech.

Calculating the current value of maximal sub-harmonic summation and the‘so far’ accumulated path scores.

Determining adaptive pruning base on a certain history (lookback oflength M) of the ‘so far’ best path and calculating the adaptive pruningthreshold, then do path extension based on the degree of periodicity andpruning based on the adaptive pruning threshold.

Tracing back from the certain time-frame (lookahead trace back of lengthN) to the current time frame and output only the current time frame asthe stable rough pitch estimate.

High-resolution, fine search in the neighborhood of the stable roughpitch estimate for estimating the precise pitch and output the precisepitch as the final results of the look-ahead adaptive pruning tracingback procedure.

In more detail the following occurs. Information of pitch is firstprocessed by calculating transition probability of every possiblepitch-movement where pitch movement is preferably measured on ERBauditory sensitivity scale, in the voiced segments of speech. Thecalculation of transition scores can be done as follows:

PitchMovementScore [k][j]=(1−(PitchMove/MaxMove)*(PitchMove/MaxMove)),where pitch movement and MaxMove are measured in ERB auditorysensitivity scale.

The movement of pitch will not exceed (1%/1 ms) in voiced segments [5],for a male speaker, F0 is around 50-120 Hz, for female speaker, F0 isaround 120-220 Hz, the average of F0 is around 127.5 Hz${{{From}\quad {Hz}\quad {to}\quad {Erb}\text{:}\quad {{Erb}({Hz})}} = {21.4*{\log_{10}\left( {1 + \frac{f}{230}} \right)}}};$

MaxMove (in Hz) is 12.75 Hz within 10 ms.⇄0.5 Erbs within 10 ms

Next, the concurrent value of maximal sub harmonic summation iscalculated and the ‘so far’ (from the beginning of the speech signal tothe concurrent time frame) accumulated path scores. The ‘so far’accumulated path scores can be calculated using the following recursiveformula: AccumulatedScores [j][frame−1]+PitchMovement [k][j]*CurrentSHS[k][frame];

Path extension only occurs on those possible pitch movements, withtransition probability score greater than (preferably) 0.6. The pathextensions with transition probability score less than or equal to 0.6are skipped. Preferably, adaptive pruning is based on the accumulatedpath scores within history of (advantageously) 0.5 second. This isdenoted as the ReferenceAccumulatedScore. Preferably, the adaptive pathextension uses a decision criterion where a path extension only occursfor those possible pitch movements with a transition score greater than0.6. A path extension with a transition score less than or equal to 0.6is skipped. In addition or alternatively, adaptive pruning is based onthe degree of voicing. A method according to claim 6 wherein theadaptive pruning uses a decision criteria based on the degree ofvoicing:

Prune tightly pruning on a path if the accumulated path scores withinhistory of, for instance, 0.5 second is less than 99.9% of the maximalaccumulated path scores within the same history and there exists muchmore information of periodicity at the current time frame, or expressedin a formula: if (AccumulatedScores[j][frame−1]−ReferenceAccumulatedScore) is less than 99.9% of the(MaxAccumulatedScores [frame−1]−ReferenceAccumulatedScore) and there ismuch more periodicity information at the current time frame (e.g.,CurrentSHS [j][frame]≧80.0% of the CurrentMaxSHS [frame]).

Prune loosely on a path if there is little, vague information of pitchat current time frame, extend the previous path to the current mostpossible, maximal and minimal pitch movements. Loosely pruning occurs ifthere exists less information of periodicity at the current time frame.This is because the beginning of a sentence mostly consists of silenceand as such the accumulated path scores is too small to prune tightly,which is different from the beginning of the sentence to thevoiced-unvoiced segments. In that case, there is little, vagueinformation of pitch at the current time frame. Loosely pruning occursby extending the previous path to the current most possible, maximal andminimal pitch movements.

High-resolution, fine pitch search in the neighborhood of the stablerough pitch estimate for estimating the precise pitch uses a cubicspline interpolation on correlagram. This can significantly reduce theactive states in the look-ahead adaptive pruning trace back of the F₀without a trade-off in accuracy. The high-resolution, fine pitch searchat high frequency resolution (for high pitch quality) uses maximizationof the normalized correlation inside quasi-periodic signal in analysiswindow that contains more than one complete pitch period.

Default window length is two times the maximal complete pitch period.${f_{0} \geq {50\quad {Hz}}},{{{{pitch}\quad {period}} \leq \frac{1}{50}} = {0.020\quad s}},$

window length=2*0.020 s=40 ms

Using the look-ahead adaptive pruning trace back of the F₀, has theadvantage that it is almost free from suffering multiple or sub-multiplepitch errors which exist in many pitch detection algorithm based on thepeak-picking rules. Experiments have shown that both tone error rate(TER) and character error rate (CER) reduces significantly when comparedto the heuristic peak-picking rules. Additionally, it improves theprobability of accuracy without trade-off efficiency since it looksahead 0.20 s and adaptively pruned many unnecessary paths based on theinformation of pitch, whatever voiced or unvoiced.

Features for Mandarin Speech Recognition

Referring to the five Mandarin lexical tones, the first (high) and third(low) tone mainly differ in pitch level, whereas the pitch derivative isclose to zero. On the contrary, the second (rising) and fourth (falling)tone span a pitch range, but with clear positive or negative derivative.Thus, both pitch and its derivative are candidate features for tonerecognition. The potential of curvature information (2nd derivative) isless clear.

According to the invention, the degree of voicing v(f; t) and/or itsderivative are represented in the feature vector. Preferably the degreeof voicing is represented by a measure of a (preferably normalized)short-time auto-correlation, as expressed by the regression coefficientsof the second-order regression of the auto-correlation contour. This canbe defined as:${v\left( {f,t} \right)} = {\frac{\sum\limits_{n = {N_{1}{(t)}}}^{N_{2}{(t)}}{{s(n)} \cdot {s\left( {n - \frac{f_{sample}}{f}} \right)}}}{\left( {\sum\limits_{n = {N_{1}{(t)}}}^{N_{2}{(t)}}{{s^{2}(n)} \cdot {\sum\limits_{n = {N_{1}{(t)}}}^{N_{2}{(t)}}{s^{2}\left( {n - \frac{f_{sample}}{f}} \right)}}}} \right)^{\frac{1}{2}}} \leq 1}$

Using the degree of voicing as a feature, assists in syllablesegmentation and in disambiguating voiced and unvoiced consonants. Ithas been verified that the maximal correlation of the speech signal canbe used as a reliable measure of the pitch estimate (refer to the nexttable). This is partially due to the fact that maximal correlation is ameasure of periodicity. By including this feature, it can provideinformation of the degree of periodicity in the signal, thus improvingthe recognition accuracy.

Threshold: Corresponding Correlation  0.52 0.80 0.92 of the pitchestimates Global Error Rate: Conditioning 16.734% 4.185% 1.557% on thecorrelation threshold. Estimated prob. of sub (multiples) pitch errorbetween SHS and PDT

Energy and its derivative(s) may also be taken as a tone features, butsince these components are already represented in the spectral featurevector, these components are not considered here any further.

The tone features are defined as two parts. First is the regressioncoefficients of the second-order weighted regression of the de-intonatedF0 contour over neighboring frames, with a window size related to theaverage length of a syllable and weights corresponding to the degree ofthe periodicity of the signal. The second part deals with the degree ofthe periodicity of the signal, which are the regression coefficients ofthe second-order regression of the auto-correlation contour, with awindow size related to the average length of a syllable and the lag ofcorrelation corresponding to the reciprocal of the pitch estimate fromlook-ahead tracing back procedure.

Long-term Pitch Normalization

In itself using pitch as a tone feature may in fact degrade recognitionperformance. This is caused by the fact that a pitch contour is asuperposition of:

a) the speaker's base pitch,

b) the sentence-level prosody,

c) the actual tone, and

d) statistical variation.

While (c) is the desired information and (d) is handled by the HMM, (a)and (b) are irrelevant for tone recognition, but their variation exceedsthe difference between first and third tone. This is illustrated in FIG.3 for an example pitch contour representing a spoken sentence 151 of the863 male test set. In this sentence, the pitch level of first and thirdtone become indistinguishable, due to sentence prosody. Within thesentence, the phrase component spans already a range of 50 Hz, whereasthe pitch of an adult speaker may range from 100 to 300 Hz. FIG. 3 showson top the pitch contour, where the dotted line denotes the (estimated)phrase component. The thick lines denote the areas with a voicing degreeabove 0.6. The lower part of FIG. 3 shows the corresponding degree ofvoicing.

It has been proposed to apply “cepstral mean subtraction” to the logpitch to obtain gender-independent pitch contours. While thiseffectively removes the speaker bias (a), the phrase effect (b) is notaccounted for.

According to the invention, the lexical tone effect present in thesignal is kept by removing the phrase intonation effect and randomeffect. For Chinese, the lexical tone effect refers to the lexicalpronunciation of tone specified within a Chinese syllable. The phraseintonation effect refers to the intonation effect exists in pitchcontour which is caused by the acoustic realization of a multi-syllableChinese word. Therefore, according to the invention, the estimated pitch{circumflex over (F)}₀(t) is normalized by subtracting speaker andphrase effect. The phrase intonation effect is defined as the long-termtendency of the voiced F₀ contour, which can be approximated by a movingaverage of the {circumflex over (F)}₀(t) contour in the neighborhood oft. Preferably a weighted moving average is used, where advantageouslythe weights relate to the degree of the periodicity of the signal. Thephrase intonation effect is removed from the {circumflex over (F)}₀(t)contour under superposition assumption. Experiments confirm this. Thisgives: $\begin{matrix}{{{{\hat{F}}_{0}^{\prime}(t)} = {{{\hat{F}}_{0}(t)} - \frac{\sum\limits_{\tau = {{- \Delta}\quad T_{3}}}^{{+ \Delta}\quad T_{3}}{{{\hat{F}}_{0}\left( {t + \tau} \right)} \cdot {w\left( {{{\hat{F}}_{0}\left( {t + \tau} \right)},{t + \tau}} \right)}}}{\sum\limits_{\tau = {{- \Delta}\quad T_{3}}}^{{+ \Delta}\quad T_{3}}{w\left( {{{\hat{F}}_{0}\left( {t + \tau} \right)},{t + \tau}} \right)}}}},} & (2)\end{matrix}$

In its simplest form, the moving average is estimated with w(f; t)=1,giving a straight-forward moving average. Preferably, a weighted movingaverage is calculated, where advantageously the weight represents thedegree of voicing (w(f; t)=v(f; t)). This latter average yields aslightly improved estimate by focussing on clearly voiced regions.Optimal performance of the weighted moving average filter is achievedfor a window of approximately 1.0 second.

A preferred method for decomposing the F₀ contour into a tone effect,phrase effect and random effect involves the following steps:

Calculating the normalized-correlation of the speech signal, with timelag corresponding to the reciprocal of the pitch estimate fromlook-ahead tracing back procedure,

Smoothing the normalized-correlation contour by a moving average ormedian filter over neighboring frames (with window size relating to theaverage length of a syllable).

Preferably, the moving average filter is:

Y-smoothed(t)=(1*y(t−5)+2*y(t−4)+3*y(t−3)+

4 *y(t−2)+5*y(t−1)+5*y(t)+5*

y(t+1)4*y(t+2)+3*y(t+3)+2*y(t+4)+

1*y(t+5))/30

Calculating the coefficients of the second order regression of theauto-correlation over neighboring frames (with window size related tothe average length of a syllable). Preferably, the calculation of theregression coefficients γ₀, γ₁, γ₂ of the smoothed auto-correlation usesleast square criteria over n (n=11) frames. For run-time efficiency,this operation can be skipped and γ₀ can be replaced by smoothedcorrelation coefficients. A constant data matrix is used:$\begin{pmatrix}{{2n} + 1} & 0 & \frac{{n\left( {n + 1} \right)}\left( {{2n} + 1} \right)}{3} \\0 & \frac{{n\left( {n + 1} \right)}\left( {{2n} + 1} \right)}{3} & 0 \\\frac{{n\left( {n + 1} \right)}\left( {{2n} + 1} \right)}{3} & 0 & \frac{{n\left( {n + 1} \right)}\left( {{2n} + 1} \right)\left( {{3n^{2}} + {3n} - 1} \right)}{15}\end{pmatrix},$

Alternatively, the calculation of the regression coefficients of the F₀contour uses weighted least square criteria over n (n=11) frames, with adata matrix which is a function of weights, ${\begin{pmatrix}{\sum\limits_{l = {- n}}^{n}u_{t}} & {\sum\limits_{l = {- n}}^{n}{u_{t}l}} & {\sum\limits_{l = {- n}}^{n}{u_{t}l^{2}}} \\{\sum\limits_{l = {- n}}^{n}{u_{t}l}} & {\sum\limits_{l = {- n}}^{n}{u_{t}l^{2}}} & {\sum\limits_{l = {- n}}^{n}{u_{t}l^{3}}} \\{\sum\limits_{l = {- n}}^{n}{u_{t}l^{2}}} & {\sum\limits_{l = {- n}}^{n}{u_{t}l^{3}}} & {\sum\limits_{l = {- n}}^{n}{u_{t}l^{4}}}\end{pmatrix}\quad {where}\quad {weights}\quad {are}\text{:}\quad u_{t}} = \begin{pmatrix}{1,{\gamma_{0,t} \geq 0.4}} \\\gamma_{0,t} \\{0,{\gamma_{0,t} \leq 0.1}}\end{pmatrix}$

Calculating the regression weights of the F₀ contour based on theconstant terms of the regression coefficients of the second orderregression of the auto-correlation over neighboring frames (with awindow size related to the average length of a syllable). Preferably,the calculation of the regression weights is based on the followingcriterion:

If the constant term γ_(0,t) of the regression coefficients of theauto-correlation is greater than 0.40, then the regression weight forthis frame t is set at approximately 1.0,

If the constant term γ₀, of the regression coefficients of theauto-correlation is less than 0.10, then the regression weight for thisframe t is set at approximately 0.0,

Otherwise the regression weight for this frame t is set at the constantterm of the regression coefficients of the auto-correlation. For theweighted regression and weighted long term moving average filterpreferably the following weights are used: $u_{t} = \begin{pmatrix}{1,{\gamma_{0,t} \geq 0.4}} \\\gamma_{0,t} \\{0,{\gamma_{0,t} \leq 0.1}}\end{pmatrix}$

Calculating the phrase intonation component of the Mandarin Chinesespeech prosody by long-term weighted-moving-average or median filter.Preferably, the window size relates to the average length of a phraseand weights relate to the regression weights of the F₀ contour.Advantageously, the window length of the long-termweighted-moving-average filter for extracting phrase intonation effectis set in the range of approximately 0.80 to 1.00 seconds.

Calculating the coefficients of the second order weighted regression ofthe de-intonated pitch contour by subtracting from the phrase intonationeffect over neighboring frames (with window size related to the averagelength of a syllable).

As described above, the F₀ contour is decomposed into lexical toneeffect, phrase intonation effect, and random effect by means of atime-variant, weighted moving average (MA) filter in conjunction withweighted (placing more emphasis on vowels) least squares of the F₀contour. Since lexical tone effect only exists in the voiced segments ofChinese syllables, the voiced-unvoiced ambiguity is resolved by theintroduction of the weighted regression over neighboring frames, withwindow size related to the average length of a syllable and weightsdepends on the degree of periodicity.

FIG. 3A shows a least squares of the F₀ contour of a sentence. FIG. 3Bshows the same contour after applying the weighted moving average (WMA)filter with weighted-least squares (WLS). The phrase intonation effectis estimated by the WMA filter. The tone effect corresponds to theconstant terms of the WLS of the F₀ contour minus the phrase intonationeffect. The following table illustrates that the phrase intonationeffect can be ignored.

(LTNlookahead, TER/TER CER/CER LTNlookback) reduction reduction (0, 0)22.94% 12.23% (40, 40) 20.51% 12.07% (50, 50) 20.19% 12.12% (60, 60)20.35% 12.05%

(traceback delay=20, correlation smoothing radius=5, frame width =0.032)

(Lexical Modelling: Tonal Preme/Core-Final in training)

(phrase trigram LM)

The optimal performance of WMA filter is experimentally determined asaround 1.0 second (as shown in above table), which can symmetricallycover rising and falling tones in most of the cases.

The following two tables illustrate that asymmetry negatively effectsthe TER (tone error rate). This is also the reason why WMA is not only anormalization factor for F₀, but also a normalization factor for phrase.

(LTNlookahead, TER/TER CER/CER LTNlookback) reduction reduction (50, 50)20.19% 12.12% (25, 25) 21.29% 12.08% (25, 75) 21.57% 12.07% (25, 50)21.09% 12.19%

(traceback delay=20, correlation smoothing radius=5, frame width=0.032)

(Lexical Modelling: Tonal Preme/Core-Final in training)

(phrase trigram LM)

(LTNlookahead, TER/TER CER/CER LTNlookback) reduction reduction (50, 50)23.54% 12.60% (1691) (baseline) (905) (baseline) (25, 25) 25.27% 12.57%(1816) (+7.33%) (903) (−0.22%) (25, 75) 25.12% 12.75% (1805) (+6.67%)(916) (+1.22%) (25, 50) 24.41% 12.72% (1754) (+3.66%) (914) (+0.99%)

(traceback delay=20, correlation smoothing radius=5, frame width=0.032)

(Lexical Modelling: Preme/Core-Final in training)

(phrase trigram LM)

Extracting Temporal Properties of Voiced Pitch Movements

By the means of second order regression of the auto-correlation,information of voicing is extracted from the speech signal. If theconstant term of the regression coefficients of the auto-correlation isgreater than a given threshold, say 0.4, then the regression weight forthis frame is set at 1.0. If the constant term of the regressioncoefficients of the auto-correlation is less than a given threshold, say0.10, then the regression weight for this frame is set at 0.0. Otherwiseit is set at the constant term of the regression coefficients of theauto-correlation. These weights are applied to the above second orderweighted regression of the de-intonated F₀ contour and long-termweighted-moving-average or median filter of the phrase intonationcomponent of the Mandarin Chinese speech prosody. These weights of thesecond order weighted regression of the de-intonation of the F0 contourare designed to emphasize/de-emphasize the voiced/unvoiced segments ofthe pitch contour in order to preserve the voiced pitch contour for thesemi-voiced consonants. The advantage of this mechanism is that, even ifthe speech segmentation has slight errors, these weights with look-aheadadaptive-pruning trace back of the F₀ contour served as the on-linesignal pre-processing front-end, will preserve the pitch contour of thevowels for the pitch contour of the consonants. This vowel-preservingproperty of the tone features has the ability to prevent modelparameters from bias estimation due to speech segmentation errors.

By using a second order regression of the auto-correlation with lagscorresponding to the reciprocal of the output of the look-ahead adaptivepruning trace back of F₀, information of periodicity is extracted fromthe speech signal. First the extracted pitch profile is processed usingpitch dynamic time-warping (PDT) technique in order to get a smoothed(nearly no multiple pitch errors) pitch contour, then second-orderweighted least squares are applied in order to extract the profiles ofthe pitch contour. Such profiles are represented by the regressioncoefficients. The constant regression coefficient is used forcalculating weights required in the decomposition of the F₀ contour. Thefirst and second of the regression coefficients are used for furtherreduction of the tone error rate. The best setting for windowing isaround 110 ms, which is less than one syllable's length in normalspeaking rate.

Generation of a Pseudo Feature Vector

According to the criteria of maximization of the local likelihoodscores, pseudo feature vectors are generated for unvoiced segments ofspeech in order to prevent model parameters in HMM from bias estimation.This is done first by calculating the sum of the regression weightswithin a regression window. For a sum of weights less than a predefinedthreshold (e.g. 0.25), the normalized features are replaced by pseudofeatures generated according to the criteria of least squares (fall backto the de-generate case, equally weighted regression).

For clear silence regions, the local minimum path in look-ahead traceback produces random values for pitch estimates. Such a de-intonated F₀estimate and its derivatives have mean zero in the assumption of priorequally distributed normalized features over neighboring frames andsymmetrical property of the probability distribution of the normalizedfeatures. With minimal variance that ensures non-degenerate probabilitydistribution in each state of HMM-based acoustic modeling. Since it isdifficult to draw a clear line between voiced and unvoiced region inunits of milli-seconds, in the voiced-unvoiced region, equally weightedregression is employed to smooth both traceable pitch in clear voicedsegments and random pitch in clear silence region.

Tone Component

As described above, in a preferred embodiment, the tone component isdefined as the locally, weighted regression of the de-intonated pitchcontour over, preferably, 110 msec., which is less than one syllablelength (in fact, approximately one average vowel length), in order toprevent from modeling the within-phase pitch contour. These weights inthe local regression, are designed to emphasize/de-emphasize thevoiced/unvoiced segments of the pitch contour in order to preserve thevoiced pitch contour for the consonants (initial/preme). The mainadvantages of this mechanism are that, even if the speech segmentationhas slight errors (it does not recognize small amount of the unvoiced asvoiced), these weights will preserve the pitch contour of the vowel(final/toneme) and take it for granted into initial/premes. In this way,statistics of the statistical models are accumulated in the trainingprocess and later in the recognition process. Moreover, it allowssimulating scores for initial/preme to prevent from hurting the tonerecognition due to speech segmentation errors.

Experimental Setup

The experiments have been performed using a Philips large-vocabularycontinuous-speech recognition system, which is a HMM-based system usingstandard MFCC features with first-order derivatives, sentence-basedcepstral mean subtraction (CMS) for simple channel normalization, andGaussian mixture densities with density-specific diagonal covariancematrices. Experiments were conducted on three different Mandarincontinuous-speech corpora, the MAT corpus (telephone, Taiwan Mandarin),a non-public PC dictation database (microphone. Taiwan Mandarin), andthe database of the 1998 Mainland Chinese 863 benchmarking. For the MATand the PC dictation database, a speaker-independent system is used. For863, a separate model is trained for each gender, and the gender isknown during decoding. The standard 863 language-model training corpus(People's Daily 1993-4) contains the test set. Thus, the system already“knows” the entirety of the test sentences, not reflecting the real-lifedictation situation. To obtain realistic performance figures, the LMtraining set has been “cleaned” by removing all 480 test sentences. Thefollowing table summarizes the corpus characteristics.

MAT PC Dictation 863 Train Test Train Test Train Test Type #Speakers 72126 241 20 2 × 83 N/a #Utterances 28896 259 27606 200 92948 2 × 240#Syl./Utt. 5.66 14.2 30.1 35.5 12.1 12.6 TPP — 3.37 — 3.54 — 3.50Lexicon size — 42038 — 42038 — 56064 CPP_(bi) — 121.8 — 63.6 — 53.4CPP_(tri) — 106.1 — 51.1 — 41.3 CPP_(tri,inside) — — — — — 14.4

PDAs are often assessed with respect to fine and gross pitch errors.Since it is assumed that the underlying existing algorithm has beenextensively tuned, and the focus is on integration with speechrecognition, the system has been optimized with respect to the toneerror rate (TER) instead. All tables except the last one show TER. TERis measured by tonal-syllable decoding, where the decoder is given thefollowing information for each syllable:

start and end frame (obtained by forced alignment),

base-syllable identity (toneless, from the test script), and

the set of tones allowed for this particular syllable

Not all five lexical tones can be combined with all Chinese syllables.The tone perplexity (TPP) has been defined as the number of possibletones for a syllable averaged over the test set.

The first column in the following experiment tables show the experimentIds (D1, D2, T1, etc.) which are intended to help to quickly identifyidentical experiments shown in more than one table.

Real-time/online DP Operation

The first experiments deal with the benefit of using Dynamic Programmingat all. The following table shows a 10-15% TER reduction from DP for MATand PCD. Only for the very clean 863 corpus, DP is not required. Since areal-life dictation system also has to deal with noise, DP is considereduseful in any case to assure robustness.

Id Pitch extractor MAT PC 863 Gain D1 SHS only 32.0% 21.4% 24.0% b/l D2SHS + DP 27.0% 19.2% 24.3% 8.4%

The second set of experiments considers the benefits of partialtraceback. Intuitively, the joint information of one syllable should besufficient, i.e. around 20-25 frames. The following table shows that 10frames are already enough to stabilize the pitch contour.Conservatively, 15 frames may be chosen.

Id Traceback length MAT PC 863 Loss D2 Whole sentence 27.0% 19.2% 24.3%B/l T1 20 frames (200 msec.) 28.3% 19.7% 24.4% 2.8% T2 15 frames (150msec.) 28.0% 20.0% 24.3% 2.9% T3 10 frames (100 msec.) 28.5% 19.6% 24.2%2.6%

Focussing on reducing the search effort, the following table shows thenumber of path recombinations (corpus average) for beam-pruning withdifferent pruning thresholds. A 93% reduction at minimal increase oftone error rate can be achieved (P3). Conservatively, setup P2 may bechosen.

Id Threshold Recomb. MAT PC 863 Loss T2 0 28.0% 20.0% 24.3%   0% P1 0.99681 28.4% 21.0% 23.9% 1.5% P2 0.999 413 29.0% 20.2% 24.4% 1.7% P3 0.9999305 28.6% 20.2% 24.7% 1.4%

Reducing the resolution from 48 quantization steps per octave to only 16yields another vast reduction of path recombinations, but leads to somedegradation (experiment R1 in the following table). This can bealleviated by fine-tuning the pitch after DP (R2).

Id Quantization Recomb. MAT PC 863 Loss P2 48 413 29.0% 20.2% 24.4% B/lR1 16 99 28.7% 21.8% 25.6% 3.9% R2  16, 99 29.4% 20.8% 24.5% 1.5% tuned

Experimental Results for the Tonal Feature Vector

Experiments have been performed to verify improvements to the featurevector according to the invention. The test were started with aconventional feature vector ō(t)=({circumflex over (F)}₀(t);Δ{circumflex over (F)}₀(t)). The following table shows that almost theentire performance is due to Δ{circumflex over (F)}₀(t). Switching off{circumflex over (F)}₀(t) has only minor effect (F2), while using it asthe only feature leads to dramatic degradation of 52% (F3). Taking thelog has no significant effect (F4).

Id Tone features MAT PC 863 Gain F1 {circumflex over (F)}₀(t);Δ{circumflex over (F)}₀(t) 37.1% 28.2% 29.9% B/l F2 Δ{circumflex over(F)}₀(t) only 37.3% 28.8% 30.1% −1.2% F3 {circumflex over (F)}₀(t) only48.7% 49.8% 44.3%  −52% F4 Log {circumflex over (F)}₀(t); logΔ{circumflex over (F)}₀(t) 36.5% 28.3% 29.8%  0.4%

The following table shows the effect of normalization, being theeffectiveness of eliminating speaker and phrase effect by subtractingthe averaged neighborhood pitch (the weight w(f, t)=1, equation (2)). Ofthe three different window widths (a moving average of 0.6 sec., 1.0sec. and 1.4 sec., respectively), the 1-second window wins by a smallmargin.

Id Normalization MAT PC 863 Gain F1 None 37.1% 28.2% 29.9% B/l N1 Movingav. 0.6 sec. 33.0% 25.7% 29.7% 6.8% N2 Moving av. 1.0 sec. 32.1% 25.9%29.1% 8.0% N3 Moving av. 1.4 sec. 32.2% 26.5% 29.6% 6.8%

The following table compares normalizing log {circumflex over (F)}₀(t)with a moving average window of 1.0 sec. to normalizing to the sentencemean. Both the MAT and the 863 corpus consist of short utterances, withlittle phrase effect. Thus, for MAT, sentence-based normalizationperforms equally to the proposed method. For 863 on the other hand,where the gender bias is already accounted for by the gender-dependentmodels, no improvements are obtained over the unnormalized case. For thePC Dictation corpus, with long utterances and strong phrase effect, animprovement could not be observed as well.

Id Normalization MAT PC 863 Gain F4 None 36.5% 28.3% 29.8% B/l N4 Movingav. 1.0 sec. 33.3% 24.8% 28.7% 8.3% N5 Sentence mean 33.2% 28.6% 30.1%2.4%

The following table shows the effect of using the 2nd-order derivativeΔΔ{circumflex over (F)}₀(t). A significant improvement of 9% is observedwhere the microphone setups benefit most.

Id ΔΔ{circumflex over (F)}₀(t) MAT PC 863 Gain N2 No 32.1% 25.9% 29.1%B/l F5 Yes 30.7% 22.9% 25.9% 9.0%

The following table shows that using voicing v(f; t) as a featureresults in a gain of 4.5%, which can be further tuned to 6.4% by simplesmoothing to reduce noise.

Id Voicing feature MAT PC 863 Gain F5 None 30.7% 22.9% 25.9% b/l V1 v(f;t) raw 29.9% 20.8% 25.5% 4.5% V2 v(f; t) smoothed 29.1% 20.7% 24.8% 6.4%

Another 6.1% is achieved from the derivative of the smoothed voicing,but no further reduction from the 2nd derivative as illustrated in thefollowing table.

Id Voicing feature MAT PC 863 Gain V2 v(f; t) smoothed 29.1% 20.7% 24.8%6.4% V3 v(f; t) smoothed, plus 27.0% 19.5% 23.5% 6.1% 1^(st) derivativeV4 v(f; t) smoothed, plus 27.7% 19.7% 23.7% 4.5% 1^(st) and 2^(nd)derivative

A final small improvement (2.5%) is obtained by using v(f; t) as theweight in local normalization, as shown in the following table.

Id Normalization MAT PC 863 Gain V3 Unweighted 27.0% 19.5% 23.5% 6.1% N6Weigthed 26.2% 19.0% 23.0% 2.5%

Taking all above optimization steps with respect to the feature vectortogether (from experiment F1 to N6), an average TER improvement of 28.4%has been achieved compared to the starting vector ō(t)=({circumflex over(F)}₀(t); Δ{circumflex over (F)}₀(t)).

Combination with Language Model

Experiments have also confirmed that an optimal tone error rate alsoleads to the best overall system performance. To show this, charactererror rates (CER) of the integrated system have been measured forselected setups, using a phrase-based recognition lexicon andphrase-bigramn/trigram language model. For completeness andcomparability, the last two rows of the following table show resultsobtained with the test set inside (“System performance test”).

Id Tone features MAT PC 863 Gain Bigram — No tone model 42.4% 18.9%11.6%  b/l F1 {circumflex over (F)}₀(t); Δ{circumflex over (F)}₀(t)38.6% 14.5% 9.5% 17.0% N2 +{circumflex over (F)}₀(t) normalization 36.4%13.7% 9.7% 19.5% F5 +ΔΔ{circumflex over (F)}₀(t) 35.0% 13.3% 8.6% 24.3%V3 +voicing features 34.4% 12.6% 8.3% 26.9% N6 +weighting 34.2% 12.9%8.1% 27.3% Trigram — no tone model 40.4% 16.4% 10.4%  b/l N6 best tonemodel 33.1% 12.0% 7.3% 25.0% 863 benchmark: Trigram, test-set inside LMtraining — no tone model — — 3.8% b/l N6 best tone model 3.4% 10.6%

The outcome confirms the good correspondence between TER and CER.Secondly, the overall relative CER improvement from tone modelingreaches an extraordinary 27.3% on average (bigram), with the smallestgain on telephone speech (19.3%), and exceeding 30% for the twomicrophone corpora. For trigram, gains are slightly smaller because thetrigram can disambiguate more cases from the linguistic context only,for which the bigram requires the tone model's assistance. (The extremecase is the 863 benchmarking LM—test set inside LM training—where mosttones are deducted correctly from the context, and tone modeling helps10.6%.

Summary

Important for constructing on-line, robust tone feature extraction is touse the joint, local information of periodicity in the neighborhood ofthe concurrent voiced time frame. The present invention eliminatesdetermining tone features directly from marginal information ofperiodicity at the concurrent time frame. Instead, the degree of voicingis treated as the distribution of the fundamental frequency.

Important aspects, robust feature extraction, which may also be used incombination with conventional techniques, are:

Extracting pitch-information by determining a measure inside the speechsignal, preferably based on Subharmonic Summation,

On-line look-ahead adaptive pruning trace back of the fundamentalfrequency, where the adaptive pruning is based on the degree of voicingand the joint information for preferably 0.50 s ago,

Removing phrase intonation, which is defined as the long-term tendencyof the voiced F₀ contour. This effect is approximated by aweighted-moving average of the F₀ contour, with weights preferablyrelated to the degree of the periodicity of the signal,

The means of second order weighted regression of the de-intonation ofthe F₀ contour over certain time frames, where the maximal window lengthis corresponding to the length of a syllable, with weights related tothe degree of the periodicity of the signal,

Second order regression of the auto-correlation over certain timeframes, where the maximal window length is corresponding to the lengthof a syllable, with time lag corresponding to the reciprocal of thepitch estimate from look-ahead tracing back procedure, and

Generation of a pseudo feature in voiced-unvoiced segments of speechsignal. Pseudo feature vectors are generated for unvoiced speech,according to the least squares criteria (fall to the de-generate case,equally weighted regression).

FIG. 4 is a block diagram illustrating a speech recognition system 450for recognizing a time-sequential input signal 400 representing speechspoken in a tonal language. The system 450 includes an input 410 forreceiving the input signal; a speech analysis subsystem 420 forrepresenting a segment of the input signal as an observation featurevector; and a unit matching subsystem 430 for matching the observationfeature vector against an inventory of trained speech recognition units,each unit being represented by at least one reference feature vector.The feature vector includes a component derived from an estimated degreeof voicing of the speech segment represented by the feature vector.Unvoiced segments of speech are represented by a pseudo feature vector.

What is claimed is:
 1. A speech recognition system for recognizing atime-sequential input signal representing speech spoken in a tonallanguage; the system including: an input for receiving the input signal;a speech analysis subsystem for representing a segment of the inputsignal as an observation feature vector; and a unit matching subsystemfor matching the observation feature vector against an inventory oftrained speech recognition units, each unit being represented by atleast one reference feature vector; wherein the feature vector includesa component derived from an estimated degree of voicing of the speechsegment represented by the feature vector and wherein unvoiced segmentsof speech are represented by a pseudo feature vector.
 2. A speechrecognition system as claimed in claim 1, wherein the derived componentrepresents the estimated degree of voicing of the speech segment.
 3. Aspeech recognition system as claimed in claim 1, wherein the derivedcomponent represents a derivative of the estimated degree of voicing ofthe speech segment.
 4. A speech recognition system as claimed in claim1, wherein the estimated degree of voicing is smoothed.
 5. A speechrecognition system as claimed in claim 1, wherein the degree of voicingis a measure of a short-time auto-correlation of an estimated pitchcontour.
 6. A speech recognition system as claimed in claim 5, whereinthe measure is formed by the regression coefficients of theauto-correlation contour.
 7. A speech recognition system as claimed inclaim 5, wherein the estimated pitch is obtained by removing a phraseintonation effect from an estimated pitch contour representing thespeech segment.
 8. A speech recognition system as claimed in claim 7,wherein the phrase intonation effect is represented by a weighted movingaverage of the estimated pitch contour.
 9. A speech recognition systemas claimed in claim 8, wherein a weight of the weighted moving averagerepresents the degree of voicing in the segment.
 10. A speechrecognition system as claimed in claim 1, wherein the feature vectorincludes a component representing a derivative of an estimated pitch ofthe speech segment.
 11. A speech recognition system as claimed in claim1, wherein a segment is considered unvoiced if a sum of regressionweights of an estimated pitch contour within a regression window.
 12. Aspeech recognition system as claimed in claim 1, wherein the pseudofeature vector includes pseudo features generated according to a leastsquares criterion.
 13. A method for recognizing a time-sequential inputsignal representing speech spoken in a tonal language; the methodcomprising the steps of: receiving the input signal; representing asegment of the input signal as an observation feature vector; andmatching the observation feature vector against an inventory of trainedspeech recognition units, each unit being represented by at least onereference feature vector; wherein the feature vector includes acomponent derived from an estimated degree of voicing of the speechsegment represented by the feature vector and wherein unvoiced segmentsof speech are represented by a pseudo feature vector.
 14. A method asclaimed in claim 13, wherein the degree of voicing is a measure of ashort-time auto-correlation of an estimated pitch contour.
 15. A methodas claimed in claim 14, wherein the estimated pitch is obtained byremoving a phrase intonation effect from an estimated pitch contourrepresenting the speech segment.
 16. A method as claimed in claim 15,wherein the phrase intonation effect is represented by a weighted movingaverage of the estimated pitch contour.
 17. A method as claimed in claim16, wherein a weight of the weighted moving average represents thedegree of voicing in the segment.
 18. A method as claimed in claim 13,wherein a segment is considered unvoiced if a sum of regression weightsof an estimated pitch contour within a regression window.
 19. A methodas claimed in claim 13, wherein the pseudo feature vector includespseudo features generated according to a least squares criterion.